In this notebook, I'm gonna analyze Google Play Store datas. While I was analyzing the data, I used Python. This study is my first data analyzing study.
Google Play Store apps and reviews Mobile apps are everywhere. They are easy to create and can be lucrative. Because of these two factors, more and more apps are being developed. In this notebook, we will do a comprehensive analysis of the Android app market by comparing over ten thousand apps in Google Play across different categories. We'll look for insights in the data to devise strategies to drive growth and retention.
Let's take a look at the data, which consists of two files:
Before jumping into the data's provided, let me first explain you about the EDA analysis.
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets for patterns, and anomalies (outliers), and form hypotheses based on our understanding of the dataset and summarize their main characteristics, often employing data visualization methods. It is an important step in any Data Analysis or Data Science project. It helps determine how best to manipulate data sources to get the answers you need.
EDA involves generating summary statistics for numerical data in the dataset and creating various graphical representations to understand the data better and make it more attractive and appealing.
The following are the various steps involved in the EDA process:
#import library
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np # linear algebra
import matplotlib.pyplot as plt
import seaborn as sns # visualization tool
from datetime import datetime
# plotly
import plotly
plotly.offline.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import warnings
#sns.set(font_scale=1.5)
warnings.filterwarnings("ignore")
# loading csv File
ps_df=pd.read_csv(r"E:\0001Almabetter\2.numerical python programming\project-eda-numeric-python\Play Store Data.csv")
# Display the Play Store App data head
ps_df.head()
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14M | 500,000+ | Free | 0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5,000,000+ | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up |
| 3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50,000,000+ | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up |
| 4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 100,000+ | Free | 0 | Everyone | Art & Design;Creativity | June 20, 2018 | 1.1 | 4.4 and up |
ps_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10841 entries, 0 to 10840 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 10841 non-null object 1 Category 10841 non-null object 2 Rating 9367 non-null float64 3 Reviews 10841 non-null object 4 Size 10841 non-null object 5 Installs 10841 non-null object 6 Type 10840 non-null object 7 Price 10841 non-null object 8 Content Rating 10840 non-null object 9 Genres 10841 non-null object 10 Last Updated 10841 non-null object 11 Current Ver 10833 non-null object 12 Android Ver 10838 non-null object dtypes: float64(1), object(12) memory usage: 1.1+ MB
# Finding the number of rows and columns in the given dataset
print(ps_df.columns)
Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
'Android Ver'],
dtype='object')
ps_df.shape
(10841, 13)
ps_df.dtypes
App object Category object Rating float64 Reviews object Size object Installs object Type object Price object Content Rating object Genres object Last Updated object Current Ver object Android Ver object dtype: object
ps_df.describe()
| Rating | |
|---|---|
| count | 9367.000000 |
| mean | 4.193338 |
| std | 0.537431 |
| min | 1.000000 |
| 25% | 4.000000 |
| 50% | 4.300000 |
| 75% | 4.500000 |
| max | 19.000000 |
Let us first define what information the columns contain based on our inspection.
play_store dataframe has 10841 rows and 13 columns. The 13 columns are identified as below:
The three features that we will be working with most frequently henceforth are Installs, Size, and Price. A careful glance of the dataset reveals that some of these columns mandate data cleaning in order to be consumed by code we'll write later. Specifically, the presence of special characters (, $ +) and letters (M k) in the Installs, Size, and Price columns make their conversion to a numerical data type difficult. Let's clean by removing these and converting each column to a numeric type.
Removing the Nan value and Duplicate present in the data set
# This user define function will give the type,count of null and non null values as well as null ratio
def playstoreinfo():
temp=pd.DataFrame(index=ps_df.columns)
temp["datatype"]=ps_df.dtypes
temp["not null values"]=ps_df.count()
temp["null value"]=ps_df.isnull().sum()
temp["% of the null value"]=ps_df.isnull().mean()
temp["unique count"]=ps_df.nunique()
return temp
playstoreinfo()
| datatype | not null values | null value | % of the null value | unique count | |
|---|---|---|---|---|---|
| App | object | 10841 | 0 | 0.000000 | 9660 |
| Category | object | 10841 | 0 | 0.000000 | 34 |
| Rating | float64 | 9367 | 1474 | 0.135965 | 40 |
| Reviews | object | 10841 | 0 | 0.000000 | 6002 |
| Size | object | 10841 | 0 | 0.000000 | 462 |
| Installs | object | 10841 | 0 | 0.000000 | 22 |
| Type | object | 10840 | 1 | 0.000092 | 3 |
| Price | object | 10841 | 0 | 0.000000 | 93 |
| Content Rating | object | 10840 | 1 | 0.000092 | 6 |
| Genres | object | 10841 | 0 | 0.000000 | 120 |
| Last Updated | object | 10841 | 0 | 0.000000 | 1378 |
| Current Ver | object | 10833 | 8 | 0.000738 | 2832 |
| Android Ver | object | 10838 | 3 | 0.000277 | 33 |
Findings
The number of null values are:
Rating has 1474 null values which contributes 13.60% of the data. Type has 1 null value which contributes 0.01% of the data. Content_Rating has 1 null value which contributes 0.01% of the data. Current_Ver has 8 null values which contributes 0.07% of the data. Android_Ver has 3 null values which contributes 0.03% of the data.
Lets first deal with the columns which contains lesser number of NaN values. By going through the NaN values, we must come up with a way to replace them with non NaN values or we need to come up with a reason for having NaN.
1). Android Ver: There are a total of 3 NaN values in this column.¶# The rows containing NaN values in the Android Ver column
ps_df[ps_df["Android Ver"].isnull()]
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4453 | [substratum] Vacuum: P | PERSONALIZATION | 4.4 | 230 | 11M | 1,000+ | Paid | $1.49 | Everyone | Personalization | July 20, 2018 | 4.4 | NaN |
| 4490 | Pi Dark [substratum] | PERSONALIZATION | 4.5 | 189 | 2.1M | 10,000+ | Free | 0 | Everyone | Personalization | March 27, 2018 | 1.1 | NaN |
| 10472 | Life Made WI-Fi Touchscreen Photo Frame | 1.9 | 19.0 | 3.0M | 1,000+ | Free | 0 | Everyone | NaN | February 11, 2018 | 1.0.19 | 4.0 and up | NaN |
# Finding the different values the 'Android Ver' column takes
ps_df["Android Ver"].value_counts()
4.1 and up 2451 4.0.3 and up 1501 4.0 and up 1375 Varies with device 1362 4.4 and up 980 2.3 and up 652 5.0 and up 601 4.2 and up 394 2.3.3 and up 281 2.2 and up 244 4.3 and up 243 3.0 and up 241 2.1 and up 134 1.6 and up 116 6.0 and up 60 7.0 and up 42 3.2 and up 36 2.0 and up 32 5.1 and up 24 1.5 and up 20 4.4W and up 12 3.1 and up 10 2.0.1 and up 7 8.0 and up 6 7.1 and up 3 4.0.3 - 7.1.1 2 5.0 - 8.0 2 1.0 and up 2 7.0 - 7.1.1 1 4.1 - 7.1.1 1 5.0 - 6.0 1 2.2 - 7.1.1 1 5.0 - 7.1.1 1 Name: Android Ver, dtype: int64
Since the NaN values in the Android Ver column cannot be replaced by any particular value, and, since there are only 3 rows which contain NaN values in this column, which accounts to less than 0.03% of the total rows in the given dataset, it can be be dropped.
ps_df.shape
(10841, 13)
# dropping rows corresponding to the to the NaN values in the 'Android Ver' column.
ps_df =ps_df[ps_df['Android Ver'].notna()]
# Shape of the updated dataframe
ps_df.shape
(10838, 13)
We were successfully able to handle the NaN values in theAndroid Vercolumn.
2). Current Ver: There are a total of 8 NaN values in this column.¶# The rows containing NaN values in the Current Ver column
ps_df[ps_df["Current Ver"].isnull()]
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15 | Learn To Draw Kawaii Characters | ART_AND_DESIGN | 3.2 | 55 | 2.7M | 5,000+ | Free | 0 | Everyone | Art & Design | June 6, 2018 | NaN | 4.2 and up |
| 1553 | Market Update Helper | LIBRARIES_AND_DEMO | 4.1 | 20145 | 11k | 1,000,000+ | Free | 0 | Everyone | Libraries & Demo | February 12, 2013 | NaN | 1.5 and up |
| 6322 | Virtual DJ Sound Mixer | TOOLS | 4.2 | 4010 | 8.7M | 500,000+ | Free | 0 | Everyone | Tools | May 10, 2017 | NaN | 4.0 and up |
| 6803 | BT Master | FAMILY | NaN | 0 | 222k | 100+ | Free | 0 | Everyone | Education | November 6, 2016 | NaN | 1.6 and up |
| 7333 | Dots puzzle | FAMILY | 4.0 | 179 | 14M | 50,000+ | Paid | $0.99 | Everyone | Puzzle | April 18, 2018 | NaN | 4.0 and up |
| 7407 | Calculate My IQ | FAMILY | NaN | 44 | 7.2M | 10,000+ | Free | 0 | Everyone | Entertainment | April 3, 2017 | NaN | 2.3 and up |
| 7730 | UFO-CQ | TOOLS | NaN | 1 | 237k | 10+ | Paid | $0.99 | Everyone | Tools | July 4, 2016 | NaN | 2.0 and up |
| 10342 | La Fe de Jesus | BOOKS_AND_REFERENCE | NaN | 8 | 658k | 1,000+ | Free | 0 | Everyone | Books & Reference | January 31, 2017 | NaN | 3.0 and up |
# Finding the different values the 'Current Ver' column takes
ps_df['Current Ver'].value_counts()
Varies with device 1459
1.0 809
1.1 263
1.2 178
2.0 151
...
5.44.1 1
7.16.8 1
04.08.00 1
2.10.06 1
2.0.148.0 1
Name: Current Ver, Length: 2831, dtype: int64
Since there are only 8 rows which contain NaN values in the Current Ver column, and it accounts to just around 0.07% of the total rows in the given dataset, and there is no particular value with which we can replace it, these rows can be dropped.
# dropping rows corresponding to the values which contain NaN in the column 'Current Ver'.
ps_df=ps_df[ps_df["Current Ver"].notna()]
# Shape of the updated dataframe
ps_df.shape
(10830, 13)
3). Type: There is only one NaN value in this column.¶# The row containing NaN values in the Type column
ps_df[ps_df["Type"].isnull()]
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9148 | Command & Conquer: Rivals | FAMILY | NaN | 0 | Varies with device | 0 | NaN | 0 | Everyone 10+ | Strategy | June 28, 2018 | Varies with device | Varies with device |
# Finding the different values the 'Type' column takes
ps_df["Type"].value_counts()
Free 10032 Paid 797 Name: Type, dtype: int64
The Typecolumn contains only two entries, namely, Free and Paid. Also, if the app is of type-paid, the price of that app will be printed in the corresponding Price column, else, it will show as '0'. In this case, the price for the respective app is printed as '0', which means the app is of type-free. Hence we can replace this NaN value with Free.
# Replacing the NaN value in 'Type' column corresponding to row index 9148 with 'Free'
ps_df.loc[9148,'Type']='Free'
ps_df[ps_df['Type'].isnull()]
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver |
|---|
4). Rating: This column contains 1470 NaN values.¶# The rows containing NaN values in the Rating column
ps_df[ps_df['Rating'].isnull()]
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 23 | Mcqueen Coloring pages | ART_AND_DESIGN | NaN | 61 | 7.0M | 100,000+ | Free | 0 | Everyone | Art & Design;Action & Adventure | March 7, 2018 | 1.0.0 | 4.1 and up |
| 113 | Wrinkles and rejuvenation | BEAUTY | NaN | 182 | 5.7M | 100,000+ | Free | 0 | Everyone 10+ | Beauty | September 20, 2017 | 8.0 | 3.0 and up |
| 123 | Manicure - nail design | BEAUTY | NaN | 119 | 3.7M | 50,000+ | Free | 0 | Everyone | Beauty | July 23, 2018 | 1.3 | 4.1 and up |
| 126 | Skin Care and Natural Beauty | BEAUTY | NaN | 654 | 7.4M | 100,000+ | Free | 0 | Teen | Beauty | July 17, 2018 | 1.15 | 4.1 and up |
| 129 | Secrets of beauty, youth and health | BEAUTY | NaN | 77 | 2.9M | 10,000+ | Free | 0 | Mature 17+ | Beauty | August 8, 2017 | 2.0 | 2.3 and up |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10824 | Cardio-FR | MEDICAL | NaN | 67 | 82M | 10,000+ | Free | 0 | Everyone | Medical | July 31, 2018 | 2.2.2 | 4.4 and up |
| 10825 | Naruto & Boruto FR | SOCIAL | NaN | 7 | 7.7M | 100+ | Free | 0 | Teen | Social | February 2, 2018 | 1.0 | 4.0 and up |
| 10831 | payermonstationnement.fr | MAPS_AND_NAVIGATION | NaN | 38 | 9.8M | 5,000+ | Free | 0 | Everyone | Maps & Navigation | June 13, 2018 | 2.0.148.0 | 4.0 and up |
| 10835 | FR Forms | BUSINESS | NaN | 0 | 9.6M | 10+ | Free | 0 | Everyone | Business | September 29, 2016 | 1.1.5 | 4.0 and up |
| 10838 | Parkinson Exercices FR | MEDICAL | NaN | 3 | 9.5M | 1,000+ | Free | 0 | Everyone | Medical | January 20, 2017 | 1.0 | 2.2 and up |
1470 rows × 13 columns
ps_df['Rating'].max()
5.0
Also, we know that the rating of any app in the play store will be in between 1 and 5. Lets check whether there are any ratings out of this range.
ps_df[(ps_df['Rating'] <1) | (ps_df['Rating']>5)]
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver |
|---|
Rating column contains 1470 NaN values which accounts to apprximately 13.5% of the rows in the entire dataset. It is not practical to drop these rows because by doing so, we will loose a large amount of data, which may impact the final quality of the analysis.ps_df['Rating'].mean()
4.191837606837612
ps_df['Rating'].median()
4.3
ps_df
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14M | 500,000+ | Free | 0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5,000,000+ | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up |
| 3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50,000,000+ | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up |
| 4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 100,000+ | Free | 0 | Everyone | Art & Design;Creativity | June 20, 2018 | 1.1 | 4.4 and up |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10836 | Sya9a Maroc - FR | FAMILY | 4.5 | 38 | 53M | 5,000+ | Free | 0 | Everyone | Education | July 25, 2017 | 1.48 | 4.1 and up |
| 10837 | Fr. Mike Schmitz Audio Teachings | FAMILY | 5.0 | 4 | 3.6M | 100+ | Free | 0 | Everyone | Education | July 6, 2018 | 1.0 | 4.1 and up |
| 10838 | Parkinson Exercices FR | MEDICAL | NaN | 3 | 9.5M | 1,000+ | Free | 0 | Everyone | Medical | January 20, 2017 | 1.0 | 2.2 and up |
| 10839 | The SCP Foundation DB fr nn5n | BOOKS_AND_REFERENCE | 4.5 | 114 | Varies with device | 1,000+ | Free | 0 | Mature 17+ | Books & Reference | January 19, 2015 | Varies with device | Varies with device |
| 10840 | iHoroscope - 2018 Daily Horoscope & Astrology | LIFESTYLE | 4.5 | 398307 | 19M | 10,000,000+ | Free | 0 | Everyone | Lifestyle | July 25, 2018 | Varies with device | Varies with device |
10830 rows × 13 columns
Visualization of distribution of rating using displot and detecting the outliers through boxplot.
fig, ax = plt.subplots(2,1, figsize=(12,7))
sns.distplot(ps_df['Rating'],color='firebrick',ax=ax[0]);
sns.boxplot(x='Rating',data=ps_df, ax=ax[1]);
The mean of the average ratings (excluding the NaN values) comes to be 4.2.
The median of the entries (excluding the NaN values) in the 'Rating' column comes to be 4.3. From this we can say that 50% of the apps have an average rating of above 4.3, and the rest below 4.3.
ps_df['Rating'].median()
4.3
# Replacing the NaN values in the 'Rating' colunm with its median value
ps_df['Rating'].fillna(value=ps_df['Rating'].median(),inplace=True)
ps_df['Rating'].value_counts()
4.3 2546 4.4 1108 4.5 1037 4.2 951 4.6 823 4.1 707 4.0 567 4.7 499 3.9 386 3.8 303 5.0 274 3.7 239 4.8 234 3.6 174 3.5 163 3.4 128 3.3 102 4.9 87 3.0 83 3.1 69 3.2 63 2.9 45 2.8 42 2.7 25 2.6 25 2.5 21 2.3 20 2.4 19 1.0 16 2.2 14 1.9 13 2.0 12 1.7 8 1.8 8 2.1 8 1.6 4 1.4 3 1.5 3 1.2 1 Name: Rating, dtype: int64
ps_df['Rating'].isna().sum()
0
# Handling the error values in the Play store data
ps_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 10830 entries, 0 to 10840 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 10830 non-null object 1 Category 10830 non-null object 2 Rating 10830 non-null float64 3 Reviews 10830 non-null object 4 Size 10830 non-null object 5 Installs 10830 non-null object 6 Type 10830 non-null object 7 Price 10830 non-null object 8 Content Rating 10830 non-null object 9 Genres 10830 non-null object 10 Last Updated 10830 non-null object 11 Current Ver 10830 non-null object 12 Android Ver 10830 non-null object dtypes: float64(1), object(12) memory usage: 1.4+ MB
ps_df['App'].value_counts()
ROBLOX 9
CBS Sports App - Scores, News, Stats & Watch Live 8
Candy Crush Saga 7
8 Ball Pool 7
ESPN 7
..
Meet U - Get Friends for Snapchat, Kik & Instagram 1
U-Report 1
U of I Community Credit Union 1
Waiting For U Launcher Theme 1
iHoroscope - 2018 Daily Horoscope & Astrology 1
Name: App, Length: 9649, dtype: int64
# Inspecting the duplicates values.
ps_df[ps_df['App']=='ROBLOX']
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1653 | ROBLOX | GAME | 4.5 | 4447388 | 67M | 100,000,000+ | Free | 0 | Everyone 10+ | Adventure;Action & Adventure | July 31, 2018 | 2.347.225742 | 4.1 and up |
| 1701 | ROBLOX | GAME | 4.5 | 4447346 | 67M | 100,000,000+ | Free | 0 | Everyone 10+ | Adventure;Action & Adventure | July 31, 2018 | 2.347.225742 | 4.1 and up |
| 1748 | ROBLOX | GAME | 4.5 | 4448791 | 67M | 100,000,000+ | Free | 0 | Everyone 10+ | Adventure;Action & Adventure | July 31, 2018 | 2.347.225742 | 4.1 and up |
| 1841 | ROBLOX | GAME | 4.5 | 4449882 | 67M | 100,000,000+ | Free | 0 | Everyone 10+ | Adventure;Action & Adventure | July 31, 2018 | 2.347.225742 | 4.1 and up |
| 1870 | ROBLOX | GAME | 4.5 | 4449910 | 67M | 100,000,000+ | Free | 0 | Everyone 10+ | Adventure;Action & Adventure | July 31, 2018 | 2.347.225742 | 4.1 and up |
| 2016 | ROBLOX | FAMILY | 4.5 | 4449910 | 67M | 100,000,000+ | Free | 0 | Everyone 10+ | Adventure;Action & Adventure | July 31, 2018 | 2.347.225742 | 4.1 and up |
| 2088 | ROBLOX | FAMILY | 4.5 | 4450855 | 67M | 100,000,000+ | Free | 0 | Everyone 10+ | Adventure;Action & Adventure | July 31, 2018 | 2.347.225742 | 4.1 and up |
| 2206 | ROBLOX | FAMILY | 4.5 | 4450890 | 67M | 100,000,000+ | Free | 0 | Everyone 10+ | Adventure;Action & Adventure | July 31, 2018 | 2.347.225742 | 4.1 and up |
| 4527 | ROBLOX | FAMILY | 4.5 | 4443407 | 67M | 100,000,000+ | Free | 0 | Everyone 10+ | Adventure;Action & Adventure | July 31, 2018 | 2.347.225742 | 4.1 and up |
ps_df[ps_df['App'].duplicated()]
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 229 | Quick PDF Scanner + OCR FREE | BUSINESS | 4.2 | 80805 | Varies with device | 5,000,000+ | Free | 0 | Everyone | Business | February 26, 2018 | Varies with device | 4.0.3 and up |
| 236 | Box | BUSINESS | 4.2 | 159872 | Varies with device | 10,000,000+ | Free | 0 | Everyone | Business | July 31, 2018 | Varies with device | Varies with device |
| 239 | Google My Business | BUSINESS | 4.4 | 70991 | Varies with device | 5,000,000+ | Free | 0 | Everyone | Business | July 24, 2018 | 2.19.0.204537701 | 4.4 and up |
| 256 | ZOOM Cloud Meetings | BUSINESS | 4.4 | 31614 | 37M | 10,000,000+ | Free | 0 | Everyone | Business | July 20, 2018 | 4.1.28165.0716 | 4.0 and up |
| 261 | join.me - Simple Meetings | BUSINESS | 4.0 | 6989 | Varies with device | 1,000,000+ | Free | 0 | Everyone | Business | July 16, 2018 | 4.3.0.508 | 4.4 and up |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10715 | FarmersOnly Dating | DATING | 3.0 | 1145 | 1.4M | 100,000+ | Free | 0 | Mature 17+ | Dating | February 25, 2016 | 2.2 | 4.0 and up |
| 10720 | Firefox Focus: The privacy browser | COMMUNICATION | 4.4 | 36981 | 4.0M | 1,000,000+ | Free | 0 | Everyone | Communication | July 6, 2018 | 5.2 | 5.0 and up |
| 10730 | FP Notebook | MEDICAL | 4.5 | 410 | 60M | 50,000+ | Free | 0 | Everyone | Medical | March 24, 2018 | 2.1.0.372 | 4.4 and up |
| 10753 | Slickdeals: Coupons & Shopping | SHOPPING | 4.5 | 33599 | 12M | 1,000,000+ | Free | 0 | Everyone | Shopping | July 30, 2018 | 3.9 | 4.4 and up |
| 10768 | AAFP | MEDICAL | 3.8 | 63 | 24M | 10,000+ | Free | 0 | Everyone | Medical | June 22, 2018 | 2.3.1 | 5.0 and up |
1181 rows × 13 columns
# dropping duplicates from the 'App' column.
ps_df.drop_duplicates(subset = 'App', inplace = True)
ps_df.shape
(9649, 13)
# Checking whether the duplicates in the 'App' column are taken care of or not
ps_df[ps_df['App']=='ROBLOX']
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1653 | ROBLOX | GAME | 4.5 | 4447388 | 67M | 100,000,000+ | Free | 0 | Everyone 10+ | Adventure;Action & Adventure | July 31, 2018 | 2.347.225742 | 4.1 and up |
ps_df[ps_df['App'].duplicated()]
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver |
|---|
We have successfully handled all the duplicate values in the App column. The resultant number of rows after droping the duplicate rows in the app column come out to be 9649.
2). Changing the datatype of the Last Updated column from string to datetime.¶ps_df['Last Updated']
0 January 7, 2018
1 January 15, 2018
2 August 1, 2018
3 June 8, 2018
4 June 20, 2018
...
10836 July 25, 2017
10837 July 6, 2018
10838 January 20, 2017
10839 January 19, 2015
10840 July 25, 2018
Name: Last Updated, Length: 9649, dtype: object
# Pandas to_datetime() function applied to the values in the last updated column helps to convert string Date time into Python Date time object.
ps_df["Last Updated"] = pd.to_datetime(ps_df['Last Updated'])
ps_df.head()
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | 2018-01-07 | 1.0.0 | 4.0.3 and up |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14M | 500,000+ | Free | 0 | Everyone | Art & Design;Pretend Play | 2018-01-15 | 2.0.0 | 4.0.3 and up |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5,000,000+ | Free | 0 | Everyone | Art & Design | 2018-08-01 | 1.2.4 | 4.0.3 and up |
| 3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50,000,000+ | Free | 0 | Teen | Art & Design | 2018-06-08 | Varies with device | 4.2 and up |
| 4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 100,000+ | Free | 0 | Everyone | Art & Design;Creativity | 2018-06-20 | 1.1 | 4.4 and up |
ps_df['Last Updated']
0 2018-01-07
1 2018-01-15
2 2018-08-01
3 2018-06-08
4 2018-06-20
...
10836 2017-07-25
10837 2018-07-06
10838 2017-01-20
10839 2015-01-19
10840 2018-07-25
Name: Last Updated, Length: 9649, dtype: datetime64[ns]
3). Changing the datatype of thePrice column from string to float.¶ps_df['Price'].value_counts()
0 8896
$0.99 143
$2.99 124
$1.99 73
$4.99 70
...
$18.99 1
$389.99 1
$19.90 1
$1.75 1
$1.04 1
Name: Price, Length: 92, dtype: int64
To convert this column from string to float, we must first drop the $ symbol from the all the values. Then we can assign float datatype to those values.
Applying the drop_dollar function to convert the values in thePrice column from string datatype to float datatype.
s = '$1.23'
if "$" in s:
print(s[1:])
1.23
# Creating a function drop-dollar which dropps the $ symbol if it is present and returns the output which is of float datatype.
def convert_dollar(val):
if '$' in val:
return float(val[1:])
else:
return float(val)
# The drop_dollar funtion applied to the price column
ps_df['Price'] = ps_df['Price'].apply(lambda x: convert_dollar(x))
ps_df['Price'].max()
400.0
ps_df['Price'].dtype
dtype('float64')
ps_df[ps_df['Price']==0]
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0.0 | Everyone | Art & Design | 2018-01-07 | 1.0.0 | 4.0.3 and up |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14M | 500,000+ | Free | 0.0 | Everyone | Art & Design;Pretend Play | 2018-01-15 | 2.0.0 | 4.0.3 and up |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5,000,000+ | Free | 0.0 | Everyone | Art & Design | 2018-08-01 | 1.2.4 | 4.0.3 and up |
| 3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50,000,000+ | Free | 0.0 | Teen | Art & Design | 2018-06-08 | Varies with device | 4.2 and up |
| 4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 100,000+ | Free | 0.0 | Everyone | Art & Design;Creativity | 2018-06-20 | 1.1 | 4.4 and up |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10836 | Sya9a Maroc - FR | FAMILY | 4.5 | 38 | 53M | 5,000+ | Free | 0.0 | Everyone | Education | 2017-07-25 | 1.48 | 4.1 and up |
| 10837 | Fr. Mike Schmitz Audio Teachings | FAMILY | 5.0 | 4 | 3.6M | 100+ | Free | 0.0 | Everyone | Education | 2018-07-06 | 1.0 | 4.1 and up |
| 10838 | Parkinson Exercices FR | MEDICAL | 4.3 | 3 | 9.5M | 1,000+ | Free | 0.0 | Everyone | Medical | 2017-01-20 | 1.0 | 2.2 and up |
| 10839 | The SCP Foundation DB fr nn5n | BOOKS_AND_REFERENCE | 4.5 | 114 | Varies with device | 1,000+ | Free | 0.0 | Mature 17+ | Books & Reference | 2015-01-19 | Varies with device | Varies with device |
| 10840 | iHoroscope - 2018 Daily Horoscope & Astrology | LIFESTYLE | 4.5 | 398307 | 19M | 10,000,000+ | Free | 0.0 | Everyone | Lifestyle | 2018-07-25 | Varies with device | Varies with device |
8896 rows × 13 columns
ps_df[ps_df['Price']!=0]
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 234 | TurboScan: scan documents and receipts in PDF | BUSINESS | 4.7 | 11442 | 6.8M | 100,000+ | Paid | 4.99 | Everyone | Business | 2018-03-25 | 1.5.2 | 4.0 and up |
| 235 | Tiny Scanner Pro: PDF Doc Scan | BUSINESS | 4.8 | 10295 | 39M | 100,000+ | Paid | 4.99 | Everyone | Business | 2017-04-11 | 3.4.6 | 3.0 and up |
| 427 | Puffin Browser Pro | COMMUNICATION | 4.0 | 18247 | Varies with device | 100,000+ | Paid | 3.99 | Everyone | Communication | 2018-07-05 | 7.5.3.20547 | 4.1 and up |
| 476 | Moco+ - Chat, Meet People | DATING | 4.2 | 1545 | Varies with device | 10,000+ | Paid | 3.99 | Mature 17+ | Dating | 2018-06-19 | 2.6.139 | 4.1 and up |
| 477 | Calculator | DATING | 2.6 | 57 | 6.2M | 1,000+ | Paid | 6.99 | Everyone | Dating | 2017-10-25 | 1.1.6 | 4.0 and up |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10735 | FP VoiceBot | FAMILY | 4.3 | 17 | 157k | 100+ | Paid | 0.99 | Mature 17+ | Entertainment | 2015-11-25 | 1.2 | 2.1 and up |
| 10760 | Fast Tract Diet | HEALTH_AND_FITNESS | 4.4 | 35 | 2.4M | 1,000+ | Paid | 7.99 | Everyone | Health & Fitness | 2018-08-08 | 1.9.3 | 4.2 and up |
| 10782 | Trine 2: Complete Story | GAME | 3.8 | 252 | 11M | 10,000+ | Paid | 16.99 | Teen | Action | 2015-02-27 | 2.22 | 5.0 and up |
| 10785 | sugar, sugar | FAMILY | 4.2 | 1405 | 9.5M | 10,000+ | Paid | 1.20 | Everyone | Puzzle | 2018-06-05 | 2.7 | 2.3 and up |
| 10798 | Word Search Tab 1 FR | FAMILY | 4.3 | 0 | 1020k | 50+ | Paid | 1.04 | Everyone | Puzzle | 2012-02-06 | 1.1 | 3.0 and up |
753 rows × 13 columns
We have successfully converted the datatype of values in the Price column from string to float.
4). Converting the values in theInstallscolumn from string datatype to integer datatype.¶s1 = "1,000,00+"
s1
'1,000,00+'
s1 = s1.replace(",","")
s1
'100000+'
s1[ 0: -1]
'100000'
# Checking the contents of the 'Installs' column
ps_df['Installs'].value_counts()
1,000,000+ 1416 100,000+ 1112 10,000+ 1029 10,000,000+ 937 1,000+ 886 100+ 709 5,000,000+ 607 500,000+ 504 50,000+ 468 5,000+ 467 10+ 384 500+ 328 50+ 204 50,000,000+ 202 100,000,000+ 188 5+ 82 1+ 67 500,000,000+ 24 1,000,000,000+ 20 0+ 14 0 1 Name: Installs, dtype: int64
To convert all the values in the Installs column from string datatype to integer datatype, we must first drop the '+' symbol from all the entries if present and then we can change its datatype.
Applying the convert_plus function to convert the values in the Installs column from string datatype to float datatype.
# Creating a function convert_plus which drops the '+' symbol if it is present and returns the output which is of integer datatype.
def convert_plus(val):
if '+' and ',' in val:
new = int(val[:-1].replace(',',''))
return new
elif '+' in val:
new1 = int(val[:-1])
return new1
else:
return int(val)
# The drop_plus funtion applied to the main dataframe
ps_df['Installs'] = ps_df['Installs'].apply(lambda x: convert_plus(x))
ps_df.head()
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10000 | Free | 0.0 | Everyone | Art & Design | 2018-01-07 | 1.0.0 | 4.0.3 and up |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14M | 500000 | Free | 0.0 | Everyone | Art & Design;Pretend Play | 2018-01-15 | 2.0.0 | 4.0.3 and up |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5000000 | Free | 0.0 | Everyone | Art & Design | 2018-08-01 | 1.2.4 | 4.0.3 and up |
| 3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50000000 | Free | 0.0 | Teen | Art & Design | 2018-06-08 | Varies with device | 4.2 and up |
| 4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 100000 | Free | 0.0 | Everyone | Art & Design;Creativity | 2018-06-20 | 1.1 | 4.4 and up |
ps_df['Installs']
0 10000
1 500000
2 5000000
3 50000000
4 100000
...
10836 5000
10837 100
10838 1000
10839 1000
10840 10000000
Name: Installs, Length: 9649, dtype: int64
ps_df['Installs'].dtype
dtype('int64')
he resultant values in the Installs column are of the integer datatype, and it represents the least number of times a particular app has been installed.
5). Converting the values in theSizecolumn to a same unit of measure(MB).¶ps_df['Size'].value_counts()
Varies with device 1227
12M 181
11M 181
13M 177
14M 176
...
721k 1
430k 1
429k 1
200k 1
619k 1
Name: Size, Length: 457, dtype: int64
We can see that the values in the Size column contains data with different units. 'M' stands for MB and 'k' stands for KB. To easily analyse this column, it is necessary to convert all the values to a single unit. In this case, we will convert all the units to MB.
We know that 1MB = 1024KB, to convert KB to MB, we must divide all the values which are in KB by 1024.
# Defining a function to convert all the entries in KB to MB and then converting them to float datatype.
def convert_kb_to_mb(val):
try:
if 'M' in val:
return float(val[:-1])
elif 'k' in val:
return round(float(val[:-1])/1024, 4)
else:
return val
except:
return val
Applying the kb_to_mb function to convert the values in the Size column to a single unit of measure (MB) and the datatype from string to float.
# The kb_to_mb funtion applied to the size column
ps_df['Size'] = ps_df['Size'].apply(lambda x: convert_kb_to_mb(x))
ps_df.head()
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19.0 | 10000 | Free | 0.0 | Everyone | Art & Design | 2018-01-07 | 1.0.0 | 4.0.3 and up |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14.0 | 500000 | Free | 0.0 | Everyone | Art & Design;Pretend Play | 2018-01-15 | 2.0.0 | 4.0.3 and up |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7 | 5000000 | Free | 0.0 | Everyone | Art & Design | 2018-08-01 | 1.2.4 | 4.0.3 and up |
| 3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25.0 | 50000000 | Free | 0.0 | Teen | Art & Design | 2018-06-08 | Varies with device | 4.2 and up |
| 4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8 | 100000 | Free | 0.0 | Everyone | Art & Design;Creativity | 2018-06-20 | 1.1 | 4.4 and up |
ps_df['Size'].dtype
dtype('O')
ps_df['Size'] = ps_df['Size'].apply(lambda x: str(x).replace('Varies with device', 'NaN') if 'Varies with device' in str(x) else x)
ps_df['Size'].value_counts()
NaN 1227
12.0 181
11.0 181
13.0 177
14.0 176
...
0.7041 1
0.4199 1
0.4189 1
0.1953 1
0.6045 1
Name: Size, Length: 456, dtype: int64
ps_df['Size'] = ps_df['Size'].apply(lambda x: float(x))
ps_df['Size'].dtype
dtype('float64')
ps_df['Size'].mean(), ps_df['Size'].median()
(20.413555699358753, 12.0)
ps_df['Size'].max()
100.0
ps_df[ps_df['Size'] != 'Varies with device']['Size'].max()
100.0
round(ps_df['Size'].mean(),4)
20.4136
A vast majority of the entries in the Size column contain the entry Varies with device. Since this entry cannot be used for analysis lets see if it can be imputed with the mean or median value of the entries in this column.
# Finding max, min, mean, and median in the Size column excluding the 'Varies with device' values.
max_size = ps_df['Size'].max()
min_size = ps_df['Size'].min()
mean_size = round(ps_df['Size'].mean(),4)
median_size = ps_df['Size'].median()
[max_size, min_size, mean_size, median_size]
[100.0, 0.0083, 20.4136, 12.0]
Visualization of distribution of `Size` using displot and detecting the outliers through boxplot.
ps_df[['Size']].boxplot()
<AxesSubplot:>
# Distplot
fig, ax = plt.subplots(2,1, figsize=(12,7))
sns.distplot(ps_df[ps_df['Size'] != 'Varies with device']['Size'], color='purple', ax=ax[0])
sns.boxplot(x='Size',data=ps_df, ax=ax[1])
<AxesSubplot:xlabel='Size'>
Also, we see that a vast majority of the entries in this column are of the value Varies with device, replacing this with any central tendency value (mean or median) may give incorrect visualizations and results. Hence these values are left as it is.
We have successfully converted all the valid entries in the Size column to a single unit of measure (MB) and the datatype from string to float.
ps_df['Size'].fillna(ps_df['Size'].median(),inplace=True)
6). Converting the datatype of values in theReviewscolumn from string to int.¶# Converting the datatype of the values in the reviews column from string to int
ps_df['Reviews'] = ps_df['Reviews'].astype(int)
ps_df.head()
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19.0 | 10000 | Free | 0.0 | Everyone | Art & Design | 2018-01-07 | 1.0.0 | 4.0.3 and up |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14.0 | 500000 | Free | 0.0 | Everyone | Art & Design;Pretend Play | 2018-01-15 | 2.0.0 | 4.0.3 and up |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7 | 5000000 | Free | 0.0 | Everyone | Art & Design | 2018-08-01 | 1.2.4 | 4.0.3 and up |
| 3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25.0 | 50000000 | Free | 0.0 | Teen | Art & Design | 2018-06-08 | Varies with device | 4.2 and up |
| 4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8 | 100000 | Free | 0.0 | Everyone | Art & Design;Creativity | 2018-06-20 | 1.1 | 4.4 and up |
ps_df.describe()
| Rating | Reviews | Size | Installs | Price | |
|---|---|---|---|---|---|
| count | 9649.000000 | 9.649000e+03 | 9649.000000 | 9.649000e+03 | 9649.000000 |
| mean | 4.192476 | 2.168145e+05 | 19.343659 | 7.785404e+06 | 1.100079 |
| std | 0.496528 | 1.832255e+06 | 20.589648 | 5.378557e+07 | 16.860857 |
| min | 1.000000 | 0.000000e+00 | 0.008300 | 0.000000e+00 | 0.000000 |
| 25% | 4.000000 | 2.500000e+01 | 5.300000 | 1.000000e+03 | 0.000000 |
| 50% | 4.300000 | 9.690000e+02 | 12.000000 | 1.000000e+05 | 0.000000 |
| 75% | 4.500000 | 2.944500e+04 | 25.000000 | 1.000000e+06 | 0.000000 |
| max | 5.000000 | 7.815831e+07 | 100.000000 | 1.000000e+09 | 400.000000 |
We have successfully converted the datatype of the values in the Reviews column from string to int.
Now that we have handled the errors and NaN values in the playstoredata.csv file, lets do the same for the userreviews.csv file.
Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.
Plot a pairwise plot between all the quantitative variables to look for any evident patterns or relationships between the features
Rating = ps_df['Rating']
Size = ps_df['Size']
Installs = ps_df['Installs']
Reviews = ps_df['Reviews']
Type = ps_df['Type']
Price = ps_df['Price']
p = sns.pairplot(pd.DataFrame(list(zip(Rating, Size, np.log(Installs), np.log10(Reviews), Price, Type)),
columns=['Rating','Size', 'Installs', 'Reviews', 'Price','Type']), hue='Type')
p.fig.suptitle("Pairwise Plot - Rating, Size, Installs, Reviews, Price",x=0.5, y=1.0, fontsize=16)
Text(0.5, 1.0, 'Pairwise Plot - Rating, Size, Installs, Reviews, Price')
FINDINGS
def plot_number_category(data):
sns.set(style="whitegrid") # Set the style to whitegrid
fig, ax = plt.subplots()
fig.set_size_inches(15, 7)
ax.set_facecolor('yellow') # Set background color to yellow
sns.countplot(data['Category'], ax=ax, palette='pastel') # Using Seaborn's pastel color palette
plt.xticks(rotation=90)
# Add count values on top of each bar
for p in ax.patches:
ax.annotate(format(p.get_height(), '.0f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center',
xytext=(0, 5),
textcoords='offset points')
plt.show()
# Assuming 'playstore_data' is your DataFrame containing the data
plot_number_category(ps_df)
sns.set_style("whitegrid", {'axes.grid' : False})
sns.lmplot(y='Rating',x='Size',data=ps_df,col="Category", hue="Category",col_wrap=4,line_kws={'color': 'red'})
<seaborn.axisgrid.FacetGrid at 0x2eb46571a90>
# Get the top 5 categories based on the number of installations
top5_cat = ps_df.groupby('Category')['Installs'].sum().nlargest(5).index.tolist()
# Filter the data for the top 5 categories
data_top5 = ps_df.groupby('Category')['Installs'].sum().loc[top5_cat].reset_index(name='Number_Installations')
# Plotting
plt.figure(figsize=(12, 6)); # Increase figure width
plt.title('Comparing top 5 categories based on Installs', color='white'); # Set title color to white
bar_plot = sns.barplot(y=data_top5['Category'], x=data_top5['Number_Installations'], palette='viridis');
# Annotate each bar with its corresponding value
for index, value in enumerate(data_top5['Number_Installations']):
plt.text(value, index, f'{value:,}', va='center', fontsize=12, color='white') # Set text color to white
plt.xlabel('Number of Installations', color='white'); # Set xlabel color to white
plt.ylabel('Category', color='white'); # Set ylabel color to white
plt.gca().set_facecolor('red'); # Set background color to red
plt.grid(axis='x', linestyle='--', alpha=0.7); # Add gridlines on x-axis
# Adjust x-axis limits to ensure all numbers are visible
plt.xlim(right=data_top5['Number_Installations'].max() * 1.1); # Extend the limit by 10% on the right side
plt.show(bar_plot);
# Grouping by Content Rating and calculating total installations
data_cont = ps_df.groupby('Content Rating')['Installs'].sum().reset_index(name='Number_Installations');
# Plotting
plt.figure(figsize=(10, 5));
plt.title('Total Installations by Content Rating', color='white'); # Set title color to white
bar_plot = sns.barplot(x=data_cont['Content Rating'], y=data_cont['Number_Installations'], palette='viridis');
# Annotate each bar with its corresponding value
for index, value in enumerate(data_cont['Number_Installations']):
plt.text(index, value, f'{value:,}', ha='center', fontsize=12, color='red'); # Set text color to red
plt.xlabel('Content Rating', color='white'); # Set xlabel color to white
plt.ylabel('Number of Installations', color='white'); # Set ylabel color to white
plt.gca().set_facecolor('yellow'); # Set background color to black
plt.grid(axis='y', linestyle='--', alpha=0.7); # Add gridlines on y-axis
plt.show(bar_plot);
ps_df.groupby('Content Rating')['Installs'].sum()
Content Rating Adults only 18+ 2000000 Everyone 52177775851 Everyone 10+ 4016271795 Mature 17+ 2437986878 Teen 16487275393 Unrated 50500 Name: Installs, dtype: int64
# Get the top 5 apps based on the number of installations
top_app = ps_df.groupby('App').size().reset_index(name='Count').nlargest(5, 'Count');
top5_app = top_app['App'].tolist();
# Filter the data for the top 5 apps
data_app = ps_df.groupby('App')['Installs'].sum().loc[top5_app].reset_index(name='Number_Installations');
# Plotting
plt.figure(figsize=(10, 5));
plt.title('Top 5 Apps by Installations', color='blue'); # Set title color to blue
bar_plot = sns.barplot(x=data_app['Number_Installations'], y=data_app['App'], palette='YlGnBu'); # Set palette to Yellow-Green-Blue
# Annotate each bar with its corresponding value
for index, value in enumerate(data_app['Number_Installations']):
plt.text(value, index, f'{value:,}', va='center', fontsize=12, color='black'); # Set text color to black
plt.xlabel('Number of Installations', color='blue'); # Set xlabel color to blue
plt.ylabel('App', color='blue'); # Set ylabel color to blue
plt.gca().set_facecolor('yellow'); # Set background color to yellow
plt.grid(axis='x', linestyle='--', alpha=0.7); # Add gridlines on x-axis
plt.show(bar_plot);
# Reading the userreviews.csv file
ur_df=pd.read_csv(r"E:\0001Almabetter\2.numerical python programming\project-eda-numeric-python\User Reviews.csv")
# Checking the top 10 rows of the data
ur_df.head()
| App | Translated_Review | Sentiment | Sentiment_Polarity | Sentiment_Subjectivity | |
|---|---|---|---|---|---|
| 0 | 10 Best Foods for You | I like eat delicious food. That's I'm cooking ... | Positive | 1.00 | 0.533333 |
| 1 | 10 Best Foods for You | This help eating healthy exercise regular basis | Positive | 0.25 | 0.288462 |
| 2 | 10 Best Foods for You | NaN | NaN | NaN | NaN |
| 3 | 10 Best Foods for You | Works great especially going grocery store | Positive | 0.40 | 0.875000 |
| 4 | 10 Best Foods for You | Best idea us | Positive | 1.00 | 0.300000 |
ur_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 64295 entries, 0 to 64294 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 64295 non-null object 1 Translated_Review 37427 non-null object 2 Sentiment 37432 non-null object 3 Sentiment_Polarity 37432 non-null float64 4 Sentiment_Subjectivity 37432 non-null float64 dtypes: float64(2), object(3) memory usage: 2.5+ MB
# Checking shape and column in dataframe
print(ur_df.columns)
rows=ur_df.shape[0]
columns=ur_df.shape[1]
print(f"the no of rows is {rows} and no of columns is {columns}")
Index(['App', 'Translated_Review', 'Sentiment', 'Sentiment_Polarity',
'Sentiment_Subjectivity'],
dtype='object')
the no of rows is 64295 and no of columns is 5
Let us first define what information the columns contain based on our inspection.
user_reviews dataframe has 64295 rows and 5 columns. The 5 columns are identified as follows:
def Urinfo():
temp1=pd.DataFrame(index=ur_df.columns)
temp1["datatype"]=ur_df.dtypes
temp1["not null values"]=ur_df.count()
temp1["null value"]=ur_df.isnull().sum()
temp1["% of the null value"]=ur_df.isnull().mean().round(4)*100
temp1["unique count"]=ur_df.nunique()
return temp1
Urinfo()
| datatype | not null values | null value | % of the null value | unique count | |
|---|---|---|---|---|---|
| App | object | 64295 | 0 | 0.00 | 1074 |
| Translated_Review | object | 37427 | 26868 | 41.79 | 27994 |
| Sentiment | object | 37432 | 26863 | 41.78 | 3 |
| Sentiment_Polarity | float64 | 37432 | 26863 | 41.78 | 5410 |
| Sentiment_Subjectivity | float64 | 37432 | 26863 | 41.78 | 4474 |
Findings
The number of null values are:
# Finding the total no of NaN values in each column.
ur_df.isnull().sum()
App 0 Translated_Review 26868 Sentiment 26863 Sentiment_Polarity 26863 Sentiment_Subjectivity 26863 dtype: int64
There are a lot of NaN values. We need to analyse these values and see how we can handle them.
# checking the NaN values in the translated rview column
ur_df[ur_df['Translated_Review'].isnull()]
| App | Translated_Review | Sentiment | Sentiment_Polarity | Sentiment_Subjectivity | |
|---|---|---|---|---|---|
| 2 | 10 Best Foods for You | NaN | NaN | NaN | NaN |
| 7 | 10 Best Foods for You | NaN | NaN | NaN | NaN |
| 15 | 10 Best Foods for You | NaN | NaN | NaN | NaN |
| 102 | 10 Best Foods for You | NaN | NaN | NaN | NaN |
| 107 | 10 Best Foods for You | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... |
| 64290 | Houzz Interior Design Ideas | NaN | NaN | NaN | NaN |
| 64291 | Houzz Interior Design Ideas | NaN | NaN | NaN | NaN |
| 64292 | Houzz Interior Design Ideas | NaN | NaN | NaN | NaN |
| 64293 | Houzz Interior Design Ideas | NaN | NaN | NaN | NaN |
| 64294 | Houzz Interior Design Ideas | NaN | NaN | NaN | NaN |
26868 rows × 5 columns
There are a total of 26868 rows containing NaN values in the Translated_Review column.
We can say that the apps which do not have a review (NaN value insted) tend to have NaN values in the columns Sentiment, Sentiment_Polarity, and Sentiment_Subjectivity in the majority of the cases.
Lets check if there are any exceptions.
# The rows corresponding to the NaN values in the translated_review column, where the rest of the columns are non null.
ur_df[ur_df['Translated_Review'].isnull() & ur_df['Sentiment'].notna()]
| App | Translated_Review | Sentiment | Sentiment_Polarity | Sentiment_Subjectivity | |
|---|---|---|---|---|---|
| 268 | 11st | NaN | Neutral | 0.0 | 0.0 |
| 15048 | Birds Sounds Ringtones & Wallpapers | NaN | Neutral | 0.0 | 0.0 |
| 22092 | Calorie Counter - MyFitnessPal | NaN | Neutral | 0.0 | 0.0 |
| 31623 | DC Comics | NaN | Neutral | 0.0 | 0.0 |
| 52500 | Garden Photo Frames - Garden Photo Editor | NaN | Neutral | 0.0 | 0.0 |
In the few exceptional cases where the values of remaining columns are non null for null values in the translated_Review column, there seems to be errors. This is because the Sentiment, sentiment ploarity and sentiment subjectivity of the review can be determined if and only if there is a corresponding review.
Hence these values are wrong and can be deleted altogather.
# Deleting the rows containing NaN values
ur_df = ur_df.dropna()
# The shape of the updated df
ur_df.shape
(37427, 5)
ur_df.iloc[1:22]
| App | Translated_Review | Sentiment | Sentiment_Polarity | Sentiment_Subjectivity | |
|---|---|---|---|---|---|
| 1 | 10 Best Foods for You | This help eating healthy exercise regular basis | Positive | 0.250000 | 0.288462 |
| 3 | 10 Best Foods for You | Works great especially going grocery store | Positive | 0.400000 | 0.875000 |
| 4 | 10 Best Foods for You | Best idea us | Positive | 1.000000 | 0.300000 |
| 5 | 10 Best Foods for You | Best way | Positive | 1.000000 | 0.300000 |
| 6 | 10 Best Foods for You | Amazing | Positive | 0.600000 | 0.900000 |
| 8 | 10 Best Foods for You | Looking forward app, | Neutral | 0.000000 | 0.000000 |
| 9 | 10 Best Foods for You | It helpful site ! It help foods get ! | Neutral | 0.000000 | 0.000000 |
| 10 | 10 Best Foods for You | good you. | Positive | 0.700000 | 0.600000 |
| 11 | 10 Best Foods for You | Useful information The amount spelling errors ... | Positive | 0.200000 | 0.100000 |
| 12 | 10 Best Foods for You | Thank you! Great app!! Add arthritis, eyes, im... | Positive | 0.750000 | 0.875000 |
| 13 | 10 Best Foods for You | Greatest ever Completely awesome maintain heal... | Positive | 0.992188 | 0.866667 |
| 14 | 10 Best Foods for You | Good health...... Good health first priority..... | Positive | 0.550000 | 0.511111 |
| 16 | 10 Best Foods for You | Health It's important world either life . thin... | Positive | 0.450000 | 1.000000 |
| 17 | 10 Best Foods for You | Mrs sunita bhati I thankful developers,to make... | Positive | 0.600000 | 0.666667 |
| 18 | 10 Best Foods for You | Very Useful in diabetes age 30. I need control... | Positive | 0.295000 | 0.100000 |
| 19 | 10 Best Foods for You | One greatest apps. | Positive | 1.000000 | 1.000000 |
| 20 | 10 Best Foods for You | good nice | Positive | 0.650000 | 0.800000 |
| 21 | 10 Best Foods for You | Healthy Really helped | Positive | 0.350000 | 0.350000 |
| 22 | 10 Best Foods for You | God health | Neutral | 0.000000 | 0.000000 |
| 23 | 10 Best Foods for You | HEALTH SHOULD ALWAYS BE TOP PRIORITY. !!. ON M... | Positive | 0.781250 | 0.500000 |
| 24 | 10 Best Foods for You | An excellent A useful | Positive | 0.650000 | 0.500000 |
There are a total of 37427 rows in the updated df.
Hence we have taken care of all the NaN values in the df.
Lets inspect the updated df
# Inspecting the sentiment column
ur_df['Sentiment'].value_counts()
Positive 23998 Negative 8271 Neutral 5158 Name: Sentiment, dtype: int64
The values in the Sentiment_Polarity and Sentiment_Subjectivitylooks correct.
On the given datasets, we successfully developed a data pipeline. We can now examine this data flow and create user-friendly visuals. It is easy to compare different measures using the visualizations, and thus to draw implications from them.
# Finding correlation between different columns in the play store data
ps_df.corr()
| Rating | Reviews | Size | Installs | Price | |
|---|---|---|---|---|---|
| Rating | 1.000000 | 0.050212 | 0.037378 | 0.034306 | -0.018674 |
| Reviews | 0.050212 | 1.000000 | 0.066152 | 0.625158 | -0.007603 |
| Size | 0.037378 | 0.066152 | 1.000000 | 0.030474 | -0.019589 |
| Installs | 0.034306 | 0.625158 | 0.030474 | 1.000000 | -0.009412 |
| Price | -0.018674 | -0.007603 | -0.019589 | -0.009412 | 1.000000 |
# Heat map for play_store
plt.figure(figsize = (20,10))
sns.heatmap(ps_df.corr(), annot= True)
plt.title('Corelation Heatmap for Playstore Data', size=20)
Text(0.5, 1.0, 'Corelation Heatmap for Playstore Data')
Priceis slightly negatively correlated with the Rating, Reviews, and Installs. This means that as the prices of the app increases, the average rating, total number of reviews and Installs fall slightly.Rating is slightly positively correlated with theInstalls and Reviews column. This indicates that as the the average user rating increases, the app installs and number of reviews also increase.merged_df = pd.merge(ps_df, ur_df, on='App', how = "inner")
merged_df.shape
(35929, 17)
def merged_dfinfo():
temp = pd.DataFrame(index=merged_df.columns)
temp['data_type'] = merged_df.dtypes
temp["count of non null values"] = merged_df.count()
temp['NaN values'] = merged_df.isnull().sum()
temp['% NaN values'] =merged_df.isnull().mean()
temp['unique_count'] = merged_df.nunique()
return temp
merged_dfinfo()
| data_type | count of non null values | NaN values | % NaN values | unique_count | |
|---|---|---|---|---|---|
| App | object | 35929 | 0 | 0.0 | 816 |
| Category | object | 35929 | 0 | 0.0 | 33 |
| Rating | float64 | 35929 | 0 | 0.0 | 22 |
| Reviews | int32 | 35929 | 0 | 0.0 | 807 |
| Size | float64 | 35929 | 0 | 0.0 | 166 |
| Installs | int64 | 35929 | 0 | 0.0 | 12 |
| Type | object | 35929 | 0 | 0.0 | 2 |
| Price | float64 | 35929 | 0 | 0.0 | 9 |
| Content Rating | object | 35929 | 0 | 0.0 | 5 |
| Genres | object | 35929 | 0 | 0.0 | 67 |
| Last Updated | datetime64[ns] | 35929 | 0 | 0.0 | 247 |
| Current Ver | object | 35929 | 0 | 0.0 | 498 |
| Android Ver | object | 35929 | 0 | 0.0 | 22 |
| Translated_Review | object | 35929 | 0 | 0.0 | 26682 |
| Sentiment | object | 35929 | 0 | 0.0 | 3 |
| Sentiment_Polarity | float64 | 35929 | 0 | 0.0 | 5295 |
| Sentiment_Subjectivity | float64 | 35929 | 0 | 0.0 | 4382 |
merged_df.corr()
| Rating | Reviews | Size | Installs | Price | Sentiment_Polarity | Sentiment_Subjectivity | |
|---|---|---|---|---|---|---|---|
| Rating | 1.000000 | 0.075736 | 0.091094 | 0.020145 | -0.010055 | 0.092903 | 0.068758 |
| Reviews | 0.075736 | 1.000000 | 0.190686 | 0.564256 | -0.020591 | -0.080021 | -0.009315 |
| Size | 0.091094 | 0.190686 | 1.000000 | 0.040817 | 0.002484 | -0.118398 | 0.013460 |
| Installs | 0.020145 | 0.564256 | 0.040817 | 1.000000 | -0.025213 | -0.057842 | -0.006307 |
| Price | -0.010055 | -0.020591 | 0.002484 | -0.025213 | 1.000000 | 0.024148 | 0.003182 |
| Sentiment_Polarity | 0.092903 | -0.080021 | -0.118398 | -0.057842 | 0.024148 | 1.000000 | 0.259668 |
| Sentiment_Subjectivity | 0.068758 | -0.009315 | 0.013460 | -0.006307 | 0.003182 | 0.259668 | 1.000000 |
# Correlation heatmap
# Heat Map for the merged data frame
plt.figure(figsize = (15,10))
sns.heatmap(merged_df.corr(), annot= True, cmap='Greens')
plt.title(' Heatmap for merged Dataframe', size=20)
Text(0.5, 1.0, ' Heatmap for merged Dataframe')
merged_df = merged_df.dropna(subset=['Sentiment', 'Translated_Review'])
merged_df.shape
(35929, 17)
merged_df.isna().sum()
App 0 Category 0 Rating 0 Reviews 0 Size 0 Installs 0 Type 0 Price 0 Content Rating 0 Genres 0 Last Updated 0 Current Ver 0 Android Ver 0 Translated_Review 0 Sentiment 0 Sentiment_Polarity 0 Sentiment_Subjectivity 0 dtype: int64
2). What is the ratio of number of Paid apps and Free apps?¶data = ps_df['Type'].value_counts()
labels = ['Free', 'Paid']
# create pie chart
plt.figure(figsize=(10,10))
colors = ["#00EE76","#7B8895"]
explode=(0.01,0.1)
plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%',explode=explode,textprops={'fontsize': 15})
plt.title('Distribution of Paid and Free apps',size=15,loc='center')
plt.legend()
<matplotlib.legend.Legend at 0x2eb4cbc5fd0>
Findings:
From the above graph we can see that 92% of apps in google play store are free and 8%are paid.
ps_df['Content Rating'].unique()
array(['Everyone', 'Teen', 'Everyone 10+', 'Mature 17+',
'Adults only 18+', 'Unrated'], dtype=object)
3). Which category of Apps from the Content Rating column are found more on playstore ?¶# Content rating of the apps
data = ps_df['Content Rating'].value_counts()
labels = ['Everyone', 'Teen', 'Everyone 10+', 'Mature 17+','Adults only 18+', 'Unrated']
#create pie chart
plt.figure(figsize=(10,10))
explode=(0,0.1,0.1,0.1,0.0,1.3)
colors = ['C4', 'r', 'c', 'g', 'm', 'k']
plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%',explode=explode,textprops={'fontsize': 15})
plt.title('Content Rating',size=20,loc='center')
plt.legend()
<matplotlib.legend.Legend at 0x2eb4c744580>
A majority of the apps (82%) in the play store are can be used by everyone.The remaining apps have various age restrictions to use it.
4).Top categories on Google Playstore?¶ps_df.groupby("Category")["App"].count().sort_values(ascending= False)
Category FAMILY 1829 GAME 959 TOOLS 825 BUSINESS 420 MEDICAL 395 PERSONALIZATION 374 PRODUCTIVITY 374 LIFESTYLE 369 FINANCE 345 SPORTS 325 COMMUNICATION 315 HEALTH_AND_FITNESS 288 PHOTOGRAPHY 281 NEWS_AND_MAGAZINES 254 SOCIAL 239 BOOKS_AND_REFERENCE 221 TRAVEL_AND_LOCAL 219 SHOPPING 202 DATING 171 VIDEO_PLAYERS 163 MAPS_AND_NAVIGATION 131 EDUCATION 119 FOOD_AND_DRINK 112 ENTERTAINMENT 102 AUTO_AND_VEHICLES 85 LIBRARIES_AND_DEMO 83 WEATHER 79 HOUSE_AND_HOME 74 EVENTS 64 ART_AND_DESIGN 63 PARENTING 60 COMICS 56 BEAUTY 53 Name: App, dtype: int64
x = ps_df['Category'].value_counts()
y = ps_df['Category'].value_counts().index
x_list = []
y_list = []
for i in range(len(x)):
x_list.append(x[i])
y_list.append(y[i])
#Number of apps belonging to each category in the playstore
plt.figure(figsize=(20,10))
plt.xlabel('Number of Apps', size=15)
plt.ylabel('App Categories', size=15)
graph = sns.barplot(y = x_list, x = y_list, palette= "tab10")
graph.set_title("Top categories on Playstore", fontsize = 25)
graph.set_xticklabels(graph.get_xticklabels(), rotation= 45, horizontalalignment='right',);
Findings:
So there are all total 33 categories in the dataset From the above output we can come to a conclusion that in playstore most of the apps are underFAMILY & GAME category and least are of EVENTS & BEAUTY Category.
# Percentage of apps belonging to each category in the playstore
plt.figure(figsize=(18,18))
plt.pie(ps_df.Category.value_counts(), labels=ps_df.Category.value_counts().index, autopct='%1.2f%%')
my_circle = plt.Circle( (0,0), 0.50, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.title('% of apps share in each Category', fontsize = 25)
plt.show()
5). Which category App's have most number of installs?¶# total app installs in each category of the play store
a = ps_df.groupby(['Category'])['Installs'].sum().sort_values()
a.plot.barh(figsize=(15,10), color = 'c', )
plt.ylabel('Total app Installs', fontsize = 15)
plt.xlabel('App Categories', fontsize = 15)
plt.xticks()
plt.title('Total app installs in each category', fontsize = 20)
Text(0.5, 1.0, 'Total app installs in each category')
Findings:
This tells us the category of apps that has the maximum number of installs. The Game, Communication and Tools categories has the highest number of installs compared to other categories of apps.
6). Average rating of the apps¶# Average app ratings
ps_df['Rating'].value_counts().plot.bar(figsize=(20,8), color = 'm' )
plt.xlabel('Average rating',fontsize = 15 )
plt.ylabel('Number of apps', fontsize = 15)
plt.title('Average rating of apps in Playstore', fontsize = 20)
plt.legend()
<matplotlib.legend.Legend at 0x2eb4cd86e20>
We can represent the ratings in a better way if we group the ratings between certain intervals. Here, we can group the rating as follows:
Lets create a new column Rating group in the main dataframe and apply these filters.
# Defining a function grouped_rating to group the ratings as mentioned above
def Rating_app(val):
''''
This function help to categories the rating from 1 to 5
as Top_rated,Above_average,Average & below Average
'''
if val>=4:
return 'Top rated'
elif val>3 and val<4:
return 'Above Average'
elif val>2 and val<3:
return 'Average'
else:
return 'Below Average'
Lets apply the grouped_rating function on the Rating column and save the output in new column named as Rating group in the main df.
# Applying grouped_rating function
ps_df['Rating_group']=ps_df['Rating'].apply(lambda x: Rating_app(x))
# Average app ratings
ps_df['Rating_group'].value_counts().plot.bar(figsize=(15,5), color = 'royalblue')
plt.xlabel('Rating Group', fontsize = 12)
plt.ylabel('Number of apps', fontsize = 12)
plt.title('Average app ratings', fontsize = 18)
plt.xticks(rotation=0)
plt.legend()
<matplotlib.legend.Legend at 0x2eb4e8a9a00>
7). What are the Top 10 installed apps in any category?¶def findtop10incategory(str):
str = str.upper()
top10 = ps_df[ps_df['Category'] == str]
top10apps = top10.sort_values(by='Installs', ascending=False).head(10)
plt.figure(figsize=(15,6), dpi=100)
plt.title('Top 10 Installed Apps',size = 20)
graph = sns.barplot(x = top10apps.App, y = top10apps.Installs, palette= "icefire")
graph.set_xticklabels(graph.get_xticklabels(), rotation= 45, horizontalalignment='right')
findtop10incategory('GAME')
Findings:
From the above graph we can see that in the Game category Subway Surfers,Candy Crush Saga, Temple Run 2 has the highest installs. In the same way we by passing different category names to the function, we can get the top 10 installed apps.
8). Top apps that are of free type.¶ # Creating a df for only free apps
free_df = ps_df[ps_df['Type'] == 'Free']
# Creating a df for top free apps
top_free_df = free_df[free_df['Installs'] == free_df['Installs'].max()]
top10free_apps=top_free_df.nlargest(10, 'Installs', keep='first')
top10free_apps.head(10)
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Rating_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 152 | Google Play Books | BOOKS_AND_REFERENCE | 3.9 | 1433233 | 12.0 | 1000000000 | Free | 0.0 | Teen | Books & Reference | 2018-08-03 | Varies with device | Varies with device | Above Average |
| 335 | Messenger – Text and Video Chat for Free | COMMUNICATION | 4.0 | 56642847 | 12.0 | 1000000000 | Free | 0.0 | Everyone | Communication | 2018-08-01 | Varies with device | Varies with device | Top rated |
| 336 | WhatsApp Messenger | COMMUNICATION | 4.4 | 69119316 | 12.0 | 1000000000 | Free | 0.0 | Everyone | Communication | 2018-08-03 | Varies with device | Varies with device | Top rated |
| 338 | Google Chrome: Fast & Secure | COMMUNICATION | 4.3 | 9642995 | 12.0 | 1000000000 | Free | 0.0 | Everyone | Communication | 2018-08-01 | Varies with device | Varies with device | Top rated |
| 340 | Gmail | COMMUNICATION | 4.3 | 4604324 | 12.0 | 1000000000 | Free | 0.0 | Everyone | Communication | 2018-08-02 | Varies with device | Varies with device | Top rated |
| 341 | Hangouts | COMMUNICATION | 4.0 | 3419249 | 12.0 | 1000000000 | Free | 0.0 | Everyone | Communication | 2018-07-21 | Varies with device | Varies with device | Top rated |
| 391 | Skype - free IM & video calls | COMMUNICATION | 4.1 | 10484169 | 12.0 | 1000000000 | Free | 0.0 | Everyone | Communication | 2018-08-03 | Varies with device | Varies with device | Top rated |
| 865 | Google Play Games | ENTERTAINMENT | 4.3 | 7165362 | 12.0 | 1000000000 | Free | 0.0 | Teen | Entertainment | 2018-07-16 | Varies with device | Varies with device | Top rated |
| 1654 | Subway Surfers | GAME | 4.5 | 27722264 | 76.0 | 1000000000 | Free | 0.0 | Everyone 10+ | Arcade | 2018-07-12 | 1.90.0 | 4.1 and up | Top rated |
| 2544 | SOCIAL | 4.1 | 78158306 | 12.0 | 1000000000 | Free | 0.0 | Teen | Social | 2018-08-03 | Varies with device | Varies with device | Top rated |
# Top free apps
top_free_df['App']
152 Google Play Books 335 Messenger – Text and Video Chat for Free 336 WhatsApp Messenger 338 Google Chrome: Fast & Secure 340 Gmail 341 Hangouts 391 Skype - free IM & video calls 865 Google Play Games 1654 Subway Surfers 2544 Facebook 2545 Instagram 2554 Google+ 2808 Google Photos 3117 Maps - Navigate & Explore 3127 Google Street View 3234 Google 3454 Google Drive 3665 YouTube 3687 Google Play Movies & TV 3736 Google News Name: App, dtype: object
# Categories in which the top 20 free apps belong to
top_free_df['Category'].value_counts().plot.bar(figsize=(20,6), color= ('darkcyan','blueviolet'))
plt.xlabel('Category', size=15)
plt.ylabel('Number of apps', size=15)
plt.title('Categories in which the top 20 free apps belong', size=19)
plt.xticks(rotation=45)
plt.legend()
<matplotlib.legend.Legend at 0x2eb4f27c5e0>
9). Top apps that are of paid type.¶# Creating a df containing only paid apps
paid_df=ps_df[ps_df['Type']=='Paid']
# Number of apps that can be installed at a particular price
paid_df.groupby('Price')['App'].count().sort_values(ascending= False).plot.bar(figsize = (20,6), color = 'crimson')
<AxesSubplot:xlabel='Price'>
This is given by:
Revenue generated through installs = (Number of installs)x(Price to install the app)
Lets define a new column Revenue in paid_df which gives the revenue generated by the app through installs alone.
# Creatng a new column 'Revenue' in paid_df
paid_df['Revenue'] = paid_df['Installs']*paid_df['Price']
paid_df.head()
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Rating_group | Revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 234 | TurboScan: scan documents and receipts in PDF | BUSINESS | 4.7 | 11442 | 6.8 | 100000 | Paid | 4.99 | Everyone | Business | 2018-03-25 | 1.5.2 | 4.0 and up | Top rated | 499000.0 |
| 235 | Tiny Scanner Pro: PDF Doc Scan | BUSINESS | 4.8 | 10295 | 39.0 | 100000 | Paid | 4.99 | Everyone | Business | 2017-04-11 | 3.4.6 | 3.0 and up | Top rated | 499000.0 |
| 427 | Puffin Browser Pro | COMMUNICATION | 4.0 | 18247 | 12.0 | 100000 | Paid | 3.99 | Everyone | Communication | 2018-07-05 | 7.5.3.20547 | 4.1 and up | Top rated | 399000.0 |
| 476 | Moco+ - Chat, Meet People | DATING | 4.2 | 1545 | 12.0 | 10000 | Paid | 3.99 | Mature 17+ | Dating | 2018-06-19 | 2.6.139 | 4.1 and up | Top rated | 39900.0 |
| 477 | Calculator | DATING | 2.6 | 57 | 6.2 | 1000 | Paid | 6.99 | Everyone | Dating | 2017-10-25 | 1.1.6 | 4.0 and up | Average | 6990.0 |
# Top app in the paid category
paid_df[paid_df['Revenue'] == paid_df['Revenue'].max()]
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Rating_group | Revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2241 | Minecraft | FAMILY | 4.5 | 2376564 | 12.0 | 10000000 | Paid | 6.99 | Everyone 10+ | Arcade;Action & Adventure | 2018-07-24 | 1.5.2.1 | Varies with device | Top rated | 69900000.0 |
# Top 10 paid apps in the play store
top10paid_apps=paid_df.nlargest(10, 'Revenue', keep='first')
top10paid_apps['App']
2241 Minecraft 5351 I am rich 5356 I Am Rich Premium 4034 Hitman Sniper 7417 Grand Theft Auto: San Andreas 2883 Facetune - For Free 5578 Sleep as Android Unlock 8804 DraStic DS Emulator 4367 I'm Rich - Trump Edition 4362 💎 I'm rich Name: App, dtype: object
# Categories in which the top 10 paid apps belong to
top10paid_apps['Category'].value_counts().plot.bar(figsize=(15,5), color= ["orange", "red", "green", "blue", "purple"])
plt.xlabel('Category',size=12)
plt.ylabel('Number of apps',size=12)
plt.title('Categories in which the top 10 paid apps belong', size=15)
plt.xticks(rotation=0)
plt.legend()
<matplotlib.legend.Legend at 0x2eb4f4c0970>
# Top paid apps according to the revenue generated through installs alone
top10paid_apps.groupby('App')['Revenue'].mean().sort_values(ascending= True).plot.barh(figsize=(16,10), color='darkorange')
plt.xlabel('Revenue Generated (USD)', size=15)
plt.title('Top apps based on revenue generated through installation fee', size=20)
plt.legend()
<matplotlib.legend.Legend at 0x2eb4f4c0640>
# Paid apps with the highest number of installs
paid_df[paid_df['Revenue'] == paid_df['Revenue'].max()]
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Rating_group | Revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2241 | Minecraft | FAMILY | 4.5 | 2376564 | 12.0 | 10000000 | Paid | 6.99 | Everyone 10+ | Arcade;Action & Adventure | 2018-07-24 | 1.5.2.1 | Varies with device | Top rated | 69900000.0 |
10). Distribution of apps based on its size¶# Values calculated earlier
[mean_size,median_size,max_size,min_size]
[20.4136, 12.0, 100.0, 0.0083]
The size of an app in our database varies from 100 MB to 0.0083 MB. We can analyse the size of the apps if we can group them into certain intervals.
We have already established that the data in the numeric values in the 'Size' column are skewed towards the left.
(< 1 MB, 1-10, 10-20, 20-30, ..., 90-100, 'Varies with device')
Lets create a function to create the size intervals
# Function to group the apps based on its size in MB
def size_apps(var):
'''
This function groups the size of an app
between ~0 to 100 MB into certain intervals.
'''
try:
if var < 1:
return 'Below 1'
elif var >= 1 and var <10:
return '1-10'
elif var >= 10 and var <20:
return '10-20'
elif var >= 20 and var <30:
return '20-30'
elif var >= 30 and var <40:
return '30-40'
elif var >= 40 and var <50:
return '40-50'
elif var >= 50 and var <60:
return '50-60'
elif var >= 60 and var <70:
return '60-70'
elif var >= 70 and var <80:
return '70-80'
elif var >= 80 and var <90:
return '80-90'
else:
return '90 and above'
except:
return var
Lets apply the size_group function on the Size column and store the results in a new column named Size group.
ps_df['size_group']=ps_df['Size'].apply(lambda x : size_apps(x))
ps_df.head()
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Rating_group | size_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19.0 | 10000 | Free | 0.0 | Everyone | Art & Design | 2018-01-07 | 1.0.0 | 4.0.3 and up | Top rated | 10-20 |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14.0 | 500000 | Free | 0.0 | Everyone | Art & Design;Pretend Play | 2018-01-15 | 2.0.0 | 4.0.3 and up | Above Average | 10-20 |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7 | 5000000 | Free | 0.0 | Everyone | Art & Design | 2018-08-01 | 1.2.4 | 4.0.3 and up | Top rated | 1-10 |
| 3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25.0 | 50000000 | Free | 0.0 | Teen | Art & Design | 2018-06-08 | Varies with device | 4.2 and up | Top rated | 20-30 |
| 4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8 | 100000 | Free | 0.0 | Everyone | Art & Design;Creativity | 2018-06-20 | 1.1 | 4.4 and up | Top rated | 1-10 |
# no of apps belonging to each size group
ps_df['size_group'].value_counts().plot.barh(figsize=(20,8),color='r').invert_yaxis()
plt.title("Number of apps in different size groups", size=20)
plt.ylabel('App size in MB', size=15)
plt.xlabel('No of apps', size=15)
plt.legend()
<matplotlib.legend.Legend at 0x2eb4f57bb20>
# average no of user reviews in each size group
ps_df.groupby('size_group')['Reviews'].mean().sort_values().plot.barh(figsize=(20,8), color = 'green')
plt.title("Average number of user reviews (in millions)", size=20)
plt.xlabel('Average no of user reviews', size=15)
plt.ylabel('App size in MB', size=15)
plt.legend()
<matplotlib.legend.Legend at 0x2eb4f5dcf40>
# average number of app installs in each category
ps_df.groupby('size_group')['Installs'].mean().sort_values(ascending= False).plot.barh(figsize=(20,8),color='sandybrown').invert_yaxis()
plt.title("Average number of app installs (In 10 millions)", size=20)
plt.ylabel('App size in MB', size=15)
plt.xlabel('Average no of app installs', size=15)
plt.legend()
<matplotlib.legend.Legend at 0x2eb4f61e340>
There are a good number of apps whose size varies with the device.
The apps which are smaller in size on average have lower no of app installs and user reviews.
11). Android version based on each category¶Now I am going to group it to 1 till 8 version of android. Change the null value to 1.0.
ps_df['Android Ver'].replace(to_replace=['4.4W and up','Varies with device'], value=['4.4','1.0'],inplace=True)
ps_df['Android Ver'].replace({k: '1.0' for k in ['1.0','1.0 and up','1.5 and up','1.6 and up']},inplace=True)
ps_df['Android Ver'].replace({k: '2.0' for k in ['2.0 and up','2.0.1 and up','2.1 and up','2.2 and up','2.2 - 7.1.1','2.3 and up','2.3.3 and up']},inplace=True)
ps_df['Android Ver'].replace({k: '3.0' for k in ['3.0 and up','3.1 and up','3.2 and up']},inplace=True)
ps_df['Android Ver'].replace({k: '4.0' for k in ['4.0 and up','4.0.3 and up','4.0.3 - 7.1.1','4.1 and up','4.1 - 7.1.1','4.2 and up','4.3 and up','4.4','4.4 and up']},inplace=True)
ps_df['Android Ver'].replace({k: '5.0' for k in ['5.0 - 6.0','5.0 - 7.1.1','5.0 - 8.0','5.0 and up','5.1 and up']},inplace=True)
ps_df['Android Ver'].replace({k: '6.0' for k in ['6.0 and up']},inplace=True)
ps_df['Android Ver'].replace({k: '7.0' for k in ['7.0 - 7.1.1','7.0 and up','7.1 and up']},inplace=True)
ps_df['Android Ver'].replace({k: '8.0' for k in ['8.0 and up']},inplace=True)
ps_df['Android Ver'].fillna('1.0', inplace=True)
print(ps_df.groupby('Category')['Android Ver'].value_counts())
Type_cat = ps_df.groupby('Category')['Android Ver'].value_counts().unstack().plot.bar(figsize=(25,8), width=2)
plt.xticks()
plt.show()
Category Android Ver
ART_AND_DESIGN 4.0 50
2.0 9
3.0 2
1.0 1
5.0 1
..
WEATHER 4.0 38
1.0 23
2.0 10
5.0 7
3.0 1
Name: Android Ver, Length: 199, dtype: int64
Findings:
It is clearly evident from the above plot that majority of the apps are working on Android_Ver 4.0 and up.
# Basic inspection
ur_df.columns
Index(['App', 'Translated_Review', 'Sentiment', 'Sentiment_Polarity',
'Sentiment_Subjectivity'],
dtype='object')
import matplotlib
counts = list(ur_df['Sentiment'].value_counts())
labels = 'Positive Reviews', 'Negative Reviews','Neutral Reviews'
matplotlib.rcParams['font.size'] = 20
matplotlib.rcParams['figure.figsize'] = (10, 15)
plt.pie(counts, labels=labels, explode=[0.01, 0.05, 0.05], shadow=True, autopct="%.2f%%")
plt.title('Percentage of Review Sentiments', fontsize=20)
plt.axis('off')
plt.legend(bbox_to_anchor=(0.9, 0, 0.5, 1))
plt.show()
Findings:
2). Apps with the highest number of positive reviews¶# positive reviews
positive_ur_df=ur_df[ur_df['Sentiment']=='Positive']
positive_ur_df
| App | Translated_Review | Sentiment | Sentiment_Polarity | Sentiment_Subjectivity | |
|---|---|---|---|---|---|
| 0 | 10 Best Foods for You | I like eat delicious food. That's I'm cooking ... | Positive | 1.000000 | 0.533333 |
| 1 | 10 Best Foods for You | This help eating healthy exercise regular basis | Positive | 0.250000 | 0.288462 |
| 3 | 10 Best Foods for You | Works great especially going grocery store | Positive | 0.400000 | 0.875000 |
| 4 | 10 Best Foods for You | Best idea us | Positive | 1.000000 | 0.300000 |
| 5 | 10 Best Foods for You | Best way | Positive | 1.000000 | 0.300000 |
| ... | ... | ... | ... | ... | ... |
| 64217 | Housing-Real Estate & Property | I able set range 1cr, scroll space 0-1cr range... | Positive | 0.233333 | 0.550000 |
| 64221 | Housing-Real Estate & Property | Everything old stuff neither clear sold proper... | Positive | 0.021591 | 0.259470 |
| 64222 | Housing-Real Estate & Property | Most ads older many agents ..not much owner po... | Positive | 0.173333 | 0.486667 |
| 64223 | Housing-Real Estate & Property | If photos posted portal load, fit purpose. I'm... | Positive | 0.225000 | 0.447222 |
| 64227 | Housing-Real Estate & Property | I property business got link SMS happy perform... | Positive | 0.800000 | 1.000000 |
23998 rows × 5 columns
positive_ur_df.groupby('App')['Sentiment'].value_counts().nlargest(10).plot.barh(figsize=(10,8),color='seagreen').invert_yaxis()
plt.title("Top 10 positive review apps")
plt.xlabel('Total number of positive reviews')
plt.legend()
<matplotlib.legend.Legend at 0x2eb50c13730>
3). Apps with the highest number of negative reviews.¶negative_ur_df=ur_df[ur_df['Sentiment']=='Negative']
negative_ur_df
| App | Translated_Review | Sentiment | Sentiment_Polarity | Sentiment_Subjectivity | |
|---|---|---|---|---|---|
| 32 | 10 Best Foods for You | No recipe book Unable recipe book. | Negative | -0.500000 | 0.500000 |
| 43 | 10 Best Foods for You | Waste time It needs internet time n ask calls ... | Negative | -0.200000 | 0.000000 |
| 68 | 10 Best Foods for You | Faltu plz waste ur time | Negative | -0.200000 | 0.000000 |
| 85 | 10 Best Foods for You | Crap Doesn't work | Negative | -0.800000 | 0.800000 |
| 95 | 10 Best Foods for You | Boring. I thought actually just texts that's i... | Negative | -0.325000 | 0.475000 |
| ... | ... | ... | ... | ... | ... |
| 64215 | Housing-Real Estate & Property | Horrible app. I wanted list property get aroun... | Negative | -0.528571 | 0.717262 |
| 64216 | Housing-Real Estate & Property | Worst app. We get nothing Time waste . They up... | Negative | -0.400000 | 0.250000 |
| 64220 | Housing-Real Estate & Property | No response support team. After I login, unabl... | Negative | -0.377778 | 0.533333 |
| 64226 | Housing-Real Estate & Property | Dumb app, I wanted post property rent give opt... | Negative | -0.287500 | 0.250000 |
| 64230 | Housing-Real Estate & Property | Useless app, I searched flats kondapur, Hydera... | Negative | -0.316667 | 0.400000 |
8271 rows × 5 columns
negative_ur_df.groupby('App')['Sentiment'].value_counts().nlargest(10).plot.barh(figsize=(15,8),color='crimson').invert_yaxis()
plt.title("Top 10 negative review apps")
plt.xlabel('Total number of negative reviews')
plt.legend()
<matplotlib.legend.Legend at 0x2eb4f6eca90>
4). Histogram of Subjectivity¶merged_df.Sentiment_Subjectivity.value_counts()
0.000000 4134
1.000000 1653
0.500000 1579
0.600000 1133
0.750000 1095
...
0.508052 1
0.454167 1
0.417316 1
0.765000 1
0.545714 1
Name: Sentiment_Subjectivity, Length: 4382, dtype: int64
plt.figure(figsize=(18,9))
plt.xlabel("Subjectivity")
plt.title("Distribution of Subjectivity")
plt.hist(merged_df[merged_df['Sentiment_Subjectivity'].notnull()]['Sentiment_Subjectivity'])
plt.show()
Findings:
0 - objective(fact), 1 - subjective(opinion)
It can be seen that maximum number of sentiment subjectivity lies between 0.4 to 0.7. From this we can conclude that maximum number of users give reviews to the applications, according to their experience.
5). Is sentiment_subjectivity proportional to sentiment_polarity?¶# scatterplot of sentiment polarity and sentiment subjectivity
plt.figure(figsize=(15, 10))
sns.scatterplot(ur_df['Sentiment_Subjectivity'], ur_df['Sentiment_Polarity'],
hue = ur_df['Sentiment'], edgecolor='white', palette="inferno")
plt.title("Google Play Store Reviews Sentiment Analysis", fontsize=20)
plt.show()
From the above scatter plot it can be concluded that sentiment subjectivity is not always proportional to sentiment polarity but in maximum number of case, shows a proportional behavior, when variance is too high or low
paid_df['Content Rating'].value_counts().plot.bar(figsize=(10,10),color='c')
plt.legend()
<matplotlib.legend.Legend at 0x2eb5220e670>
free_df['Content Rating'].value_counts().plot.bar(figsize=(10,10),color='blue')
plt.legend()
<matplotlib.legend.Legend at 0x2eb50c18430>
Most Number of content ratings which got on Google Play Store can be used by everyone.The remaining apps have various age restrictions to use it.
print(ps_df['Last Updated'].head())
#fetch update year from date
ps_df["Update year"] = ps_df["Last Updated"].apply(lambda x: x.strftime('%Y')).astype('int64')
0 2018-01-07 1 2018-01-15 2 2018-08-01 3 2018-06-08 4 2018-06-20 Name: Last Updated, dtype: datetime64[ns]
fig, ax = plt.subplots(figsize=(12,6))
sns.regplot(x="Update year", y="Rating", data=ps_df)
plt.title("Update Year VS Rating")
Text(0.5, 1.0, 'Update Year VS Rating')
paid_df["Update year"] = paid_df["Last Updated"].apply(lambda x: x.strftime('%Y')).astype('int64')
free_df["Update year"] = free_df["Last Updated"].apply(lambda x: x.strftime('%Y')).astype('int64')
paid_df.groupby("Update year")["App"].count().plot.line(marker='o')
free_df.groupby('Update year')['App'].count().plot.line(marker='o')
<AxesSubplot:xlabel='Update year'>
In the above plot, we plotted the apps updated or added over the years comparing Free vs. Paid, by observing this plot we can conclude that before 2011 there were no paid apps, but with the years passing free apps has been added more in comparison to paid apps, By comparing the apps updated or added in the year 2011 and 2018 free apps are increases from 80% to 96% and paid apps are goes from 20% to 4%. So we can conclude that most of the people are after free apps
paid_df["Update month"] = paid_df["Last Updated"].apply(lambda x: x.strftime('%m')).astype('int64')
free_df["Update month"] = free_df["Last Updated"].apply(lambda x: x.strftime('%m')).astype('int64')
paid_df.groupby("Update month")["App"].count().plot.bar(figsize=(10,8), color= "green")
plt.title("Paid Apps update over the month", size=20)
plt.legend()
<matplotlib.legend.Legend at 0x2eb4e8cf160>
Most of the paid apps too updates in the month of July same as free apps.
free_df.groupby("Update month")["App"].count().plot.bar(figsize=(10,8), color='purple')
plt.title("Free Apps update over the month", size=20)
plt.legend()
<matplotlib.legend.Legend at 0x2eb47440730>
In this data almost 50% apps are added or updated on the month of July, 25% of apps are updated or added on the month of August and rest of 25% remaining months.
In this project of analyzing play store applications, we have worked on several parameters which would help AlmaBetter to do well in launching their apps on the play store.
In the initial phase, we focused more on the problem statements and data cleaning, in order to ensure that we give them the best results out of our analysis.
AlmaBetter needs to focus more on:
1.Rating
Most of the apps have rating in between 4 and 5.
Most numbers of apps are rated at 4.3
Categories of apps have more than 4 average rating.item
2.Size
Maximum number of applications present in the dataset are of small size.
3.Installs
Majority of the apps come into these three categories, Family, Game, and Tools.
Maximum number of apps present in google play store come under Family, Game and tools but as per the installation and requirement in the market plot, scenario is not the same. Maximum installed apps comes under Game, Communication, Productivity and Social.
Subway Surfers, Facebook, Messenger and Google Drive are the most installed apps.
4.Type(Free/Paid)
About 92% apps are free and 8% apps are of paid type.
The category ‘Family’ has the highest number of paid apps.
Free apps are installed more than paid apps.
The app “I’m Rich — Trump Edition” from the category ‘Lifestyle’ is the most costly app priced at $400
5.Content Rating
Content having Everyone only has most installs, while unrated and Adults only 18+ have less installs.
6.Reviews
Number of installs is positively correlated with reviews with correlation 0.64. Sentiment Analysis
7.Sentiment
Most of the reviews are of Positive Sentiment, while Negative and Neutral have low number of reviews.
8.Sentiment Polarity / Sentiment Subjectivity
Collection of reviews shows a wide range of subjectivity and most of the reviews fall in [-0.50,0.75] polarity scale implying that the extremely negative or positive sentiments are significantly low. Most of the reviews show a mid-range of negative and positive sentiments.
Sentiment subjectivity is not always proportional to sentiment polarity but in maximum number of case, shows a proportional behavior, when variance is too high or low.
Sentiment Polarity is not highly correlated with Sentiment Subjectivity.